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Abstract. We present a novel probabilistic clustering model for objects 
that are represented via pairwise distances and observed at different time 
points. The proposed method utilizes the information given by adjacent 
time points to find the underlying cluster structure and obtain a smooth 
cluster evolution. This approach allows the number of objects and clus¬ 
ters to differ at every time point, and no identihcation on the identities 
of the objects is needed. Further, the model does not require the num¬ 
ber of clusters being specified in advance - they are instead determined 
automatically using a Dirichlet process prior. We validate our model on 
synthetic data showing that the proposed method is more accurate than 
state-of-the-art clustering methods. Finally, we use our dynamic cluster¬ 
ing model to analyze and illustrate the evolution of brain cancer patients 
over time. 


1 Introduction 

A major challenge in data analysis is to find simple representations of the data 
that best reveal the underlying structure of the investigated phenomenon Ha- 
Clustering is a powerful tool to detect such structure in empirical data, thus 
making it accessible to practitioners |14] . The problem of clustering has a very 
long history in the data mining and machine learning communities, and numer¬ 
ous clustering algorithms and applications have been studied in many different 
scientific disciplines over the past 50 years m- Applications of clustering include 
a large variety of problem domains as, for example, clustering text, social net¬ 
works, images, or biomedical data 14110121128) . Traditional clustering methods 
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such as A:-means or Gaussian mixture models [12] . rely on geometric representa¬ 
tion of the data. Nowadays, however, increasingly often there is no access to an 
underlying vectorial representation of the data since only pairwise similarities 
or distances are measured. An example application domain where such a setting 
frequently occurs is biomedical data analysis, where more often than not only 
pairwise distance data is available, e.g., when DNA or protein sequences are 
represented as pairwise distances or string alignments [1)11612:1125126] . 

Although many clustering methods exist that work on distance data, including 
single linkage clustering, complete linkage clustering, and Ward’s clustering El, 
these methods are static methods that are innocuous with respect to a potentially 
underlying time structure. However, when data is obtained at different points 
in time, dynamic models are needed that take a time component into account. 
For example in cancer research, genes are frequently measured at different time 
points, in order to examine the efficiency of a medication over time. In Network 
Security, HTTP connections are recorded at various timestamps, since network 
behaviors can quickly change over time; in Computer Vision, video streams con¬ 
tain time-indexed sequence of images. To deal with such scenarios, dynamic 
models that take the evolving nature of data into account are needed. Such a 
requirement has been addressed with evolutionary or dynamic clustering models 
for vectorial data (as for instance in or |33jl. which obtain a smooth 

clustering over multiple time points. However, to the best of our knowledge, no 
time-evolving clustering models exist that work on distance data directly, and 
clustering of time-evolving distance data is still an unsolved problem. 

In this work we will bridge this gap and present a novel Bayesian time-evolving 
clustering model based on distance data directly that is specially tailored to 
temporal data and does not require direct access to an underlying vector space. 
Our model will be able to detect cluster popularity over time, based on the rich 
gets richer phenomenon. We will be able to make predictions about how popular 
a cluster will be at time t -|- 1 if we already knew that it was a rich cluster at 
time point t. The assumption that rich clusters get richer seems plausible in 
many domains, for instance, a hot news topic is likely to stay hot for a given 
time period. Our model is also able to cope with variability of data size: the 
number of data points may vary between time points, for instance, data items 
may arrive or leave. Also, the number of clusters may vary over time and the 
model is able to adjust its capacity accordingly, and automatically. The aim is to 
find the underlying structure at every time point and to obtain a smooth cluster 
evolution which results in an easy interpretable model. Thereby the information 
shared across neighboring time points is related to the size of the clusters, the 
time-varying property of the clusters is assumed to be Markovian, and Markov 
Chain Monte Carlo (MCMC) sampling is used for inference. 

The presented method is also applicable for the less general case of pairwise 
similarity data, by using a slightly altered likelihood. Since Mercer kernels can 
encode similarities between many different kinds of objects (for instance kernels 
on graphs, images, structures or strings) the method proposed here can in par¬ 
ticularly cover the entire scope of applications of kernel-based learning, be it 
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string alignment kernels over DNA or protein sequences |1612,3126) or diffusion 
kernels on graphs [50] , 


We validate our approach by comparing it to baseline methods on simulated 
data where our new model significantly outperforms state-of-the-art clustering 
approaches. We apply our novel model to a highly topical and challenging real 
world data set of brain cancer patients from Memorial Sloan Kettering Cancer 
Center (MSKCC). This data consists of clinical notes as part of electronic health 
records (EHR) of brain cancer patients over 3 consecutive years. We model brain 
cancer patients over time where patients are grouped together based on the sim¬ 
ilarity of sentences in the clinical notes (see Section 4.2 1 . 


2 Background 

In this section we recap important background knowledge which is essential for 
the remainder of this paper. 

Partition Process: Let B„ denote a set of partitions of [n], and [n] := {1,..., n} 
denote an index set. A partition B G B„ is an equivalence relation B : [n]x [n] —>■ 
{0,1} with B{i,j) = 1 if y{i) = y{j) and B{i,j) = 0 otherwise, y denotes a 
function that maps [n] to some label set L. Alternatively, B may be represented 
as a set of disjoint non-empty subsets called “blocks”. A partition process is a 
series of distributions Pn on the set B„ in which is the marginal distribution of 
Pn+i - This means, that for each partition B G B„+i, there exists a corresponding 
partition B* G B„ which is obtained by deleting the last row and column from 
the matrix B. The properties of partition processes are in detail discussed in |19j . 
Such a process is called exchangeable if each is invariant under permutations 
of object indices, see [52] for more details. An example for the partition lattice 
for B 3 is shown in Fig. 


123 1|23 13|2 12|3 1|2|3 


1 block 2 blocks 3 blocks 

Fig. 1. Partition lattice for B 3 . 


Gauss-Dirichlet Cluster Process: The Gauss-Dirichlet cluster process consists of 
an infinite sequence of points in together with a random partition of integers 
into k blocks. A sequence of length n can be sampled as follows mm-- fix the 
number of mixture modes k, generate mixing proportions tt = (tti, ... ,TTk) from 
a symmetric Dirichlet distribution Dir(^/fc,... ,^/fc), generate a label sequence 
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{y(l),..., y(n)} from a multinomial distribution and forget the labels introduc¬ 
ing the random partition B of [n] induced by y. Integrating out tt, one arrives 
at a Dirichlet-Multinomial prior over partitions 


Pn{B\^,k) 


ki m HbGB r{nb + C/fc) 
(fc-fcs)! r{n + 0[r{ilk)Y^ ’ 


( 1 ) 


where ks ^ k denotes the number of blocks present in the partition B and 
nh is the size of block b. The limit as fc —>■ oo is well defined and known as 
the Ewens process (a.k.a. Chinese Restaurant process (CRP)), see for instance 
mm- Given such a partition B, a sequence of n-dimensional observations 
Xi G K", i = 1,... ,d, is arranged as columns of the (n x d) matrix X, and this 
X is generated from a zero-mean Gaussian distribution with covariance matrix 

Xb = Bo + B Xi, with cov{Xir, Xjs\B) = SijXg^^ -G B,jXi^^. (2) 


Xq denotes the {d x d) within-class covariance matrix and Xi the (d x d) between- 
class matrix, respectively, and 6ij denotes the Kronecker symbol. Since the par¬ 
tition process is invariant under permutations, we can always think of B being 
block-diagonal. For spherical covariance matrices (i.e. scaled identity matrices), 
Xq = aid, Xi = 13Id, the covariance structure reduces to Xb = In®OiId + B®(3Id 
= (a/„ -I- I3B) ® Id =: Xb ® Id, with cov(Xir, Xjs\B) = (a6ij -G l3B^)6rs. Thus, 
the columns of X contain independent n-dimensional vectors Xi G M" distributed 
according to a normal distribution with covariance matrix 


Xb — otln -G l3B. 


( 3 ) 


Further, the distribution factorizes over the blocks b G B. Introducing the symbol 
ib := {i : i G b} defining an index-vector of all objects assigned to block 6, the 
joint distribution reads 


p{X,B\a,(3,^,k) 


Pnm,k) 

■ rifeGs 


( 4 ) 


where rib is the size of block b and l„j a nb-vector of ones. In the viewpoint 
of clustering, n denote the number of objects we want to partition, and d the 
dimension of each object. 


Wishart-Dirichlet Cluster Process: Assume that the random matrix Xnxd fol¬ 
lows the zero-mean Gaussian distribution specified in ([^, with Xq = aid and 
Xi = jlld- Then, conditioned on the partition B, the inner product matrix 
K = XX*- /d follows a (possibly singular) Wishart distribution in d degrees of 
freedom, K ^ yVd{XB), as was shown in [^. If we directly observe the dot prod¬ 
ucts K, it suffices to consider the conditional probability of partitions Pn{B\K): 


Pn{B\K,a,l3,^,k) oeWd{K\XB) ■ Pn{B\^,k) 

oc \Xb\~^ exp {-^tiiXg^K)) ■ PniB\^,k) 


( 5 ) 
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Information Loss: Note that we assumed that there exists a matrix X with 
K = XX*" jd such that the columns of X are independent copies drawn from 
a zero-mean Gaussian in K": x ~ N{fj, = On,X = Xb)- This assumption is 
crucial, since general mean vectors correspond to a non-central Wishart model 
[S], which can be calculated analytically only in special cases, and even these 
cases have a very complicated form which imposes severe problems in deriving 
efficient inference algorithms. 

By moving from vectors X to pairwise similarities K and from similarities to 
pairwise distances D, there is a lack of information about geometric transfor¬ 
mations: assume we only observe K without access to the vectorial representa¬ 
tions Xnxd- Then we have lost the information about orthogonal transformations 
X -(r- XO with OO* = Id, i.e. about rotations and reflections of the rows in X. If 
we only observe ZI, we have additionally lost the information about translations 
of the rows X •(— X -|- -I- v € 

The models above imply that the means in each row are expected to con¬ 
verge to zero as the number of replications d goes to infinity. Thus, if we 
had access to X and if we are not sure that the above zero-mean assumption 
holds, it might be a plausible strategy to subtract the empirical row means, 
Xnxd ^ Xnxd — {1/d)Xnxd1d1d^ and then to construct a candidate matrix K 
by computing the pairwise dot products. This procedure should be statistically 
robust if d ^ n, since then the empirical means are probably close to their ex¬ 
pected values. Such a corrected matrix K fulfills two important requirements for 
selecting candidate dot product matrices: 

First, K should be “typical” with respect to the assumed Wishart model with 
/X = 0, thereby avoiding any bias introduced by a particular choice. Second, the 
choice should be robust in a statistical sense: if we are given a second observation 
from the same underlying data source, the two selected prototypical matrices Ki 
and K 2 should be similar. For small d, this correction procedure is dangerous 
since it can introduce a strong bias even if the model is correct: suppose we are 
given two replications from N{fj, = 0n,X = Sb), i.e. d = 2. After subtracting 
the row means, all row vectors lie on the diagonal line in K^, and the cluster 
structure is heavily distorted. 

Consider now the case where we observe K without access to X. For “correct¬ 
ing” the matrix K just as described above we would need a procedure which 
effectively subtracts the empirical row means from the rows of X. 
Unfortunately, there exists no such matrix transformation that operates directly 
on K without explicit construction of X. It is important to note that the “usual” 
centering transformation K ^ QKQ with Qij = Sij — ^ as used in kernel PC A 
and related algorithms does not work here: in kernel PCA the rows of X are 
assumed to be i.i.d. replications in Consequently, the centered matrix Kc is 
built by subtracting the column means: Xnxd t— Xnxd — (l/fT-)lnl^A^nx(i and 
Kc = XX* = QKQ. Here, we need to subtract the row means, and therefore 
it is inevitable to explicitly construct X, which implies that we have to choose 
a certain orthogonal transformation O. It might be reasonable to consider only 
rotations and to use the principal components as coordinate axes. This is es- 
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sentially the kernel PCA embedding procedure: compute K^. = QKQ and its 
eigenvalue decomposition K^. = VAV*, and then project on the principal axes: 
X = V. The problem with this vector-space embedding is that it is statisti¬ 
cally robust in the above sense only if d is small, because otherwise the directions 
of the principal axes might be difficult to estimate, and the estimates for two 
replicated observations might highly fluctuate, leading to different column-mean 
normalizations. Note that this condition for fixing the rotation contradicts the 
above condition n that justifies the subtraction of the means. Further, col¬ 
umn mean normalization will change the pairwise dissimilarities Dij (even if the 
model is correct!), and this change can be drastic if d is small. 

The cleanest solution might be to consider the distances D (which are either 
obtained directly as input data, or can be computed as Dij = Ka + Kjj — 2Kij ) 
and to avoid an explicit choice of K and X altogether. Therefore, one encodes 
the translation invariance directly into the likelihood, which means that the lat¬ 
ter becomes constant on all matrices K that fulfill Dij = Ka + Kjj — 2Kij. The 
information loss that occurs by moving from vectors to pairwise similarities and 
from similarities to pairwise distances is depicted in Fig. 



Fig. 2. Information loss that occurs by moving from vectors X to pairwise distances 
D. By moving from X to pairwise similarities K, information about rotation of the 
vectors is lost, by moving from K to D, information about translation is lost. One can 
reconstruct a whole equivalence class of K matrices (four examples are bordered in 
red) from one distance matrix D, i. e. the reconstruction of a similarity matrix K from 
D is not unique, as there is a surjective mapping from a set of K matrices to D. 


Translation-invariant Wishart-Dirichlet Cluster Process: A method which works 
directly on distances has been discussed in mm) as an extension of the 
Wishart-Dirichlet Cluster Process. These methods cluster static distance data, 
and no access to vectorial data is required. The model presented in |32] tack¬ 
les the problem if we do not directly observe K, but only a matrix of pairwise 
Euclidean distances D. In the following, the assumption is that the (suitably 
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pre-processed) matrix D contains squared Euclidean distances with components 

Dij = Kii -\- Kjj — 2Kij. ( 6 ) 

A squared Euclidean distance matrix D is characterized by the property of being 
of negative type, which means that x^Dx = — ^x^Kx < 0 for any x : a:‘l = 0. 
This condition is equivalent to the absence of negative eigenvalues in Kc = 
QKQ = —^QDQ. The distribution of D has been formally studied in [18], 
Eq. (3.2), where it was shown that if K follows a standard Wishart generated 
from an underlying zero-mean Gaussian process, K ^ Wdi^B)j ~D follows a 
generalized Wishart distribution, —D ~ W(1,2A'b) = W{l,—A) defined with 
respect to the transformation kernel IK = 1, where Aij = Sbu -f ^Bjj — ‘^^Bij- 
To understand the role of the transformation kernel it is useful to introduce 
the notion of a generalized Gaussian distribution with kernel IK = 1: X ~ 
N{1, fi, E). Eor any transformation L with LI = 0, the meaning of the general 
Gaussian notation is: LX ~ N{Lpt, LEL*). It follows that under the kernel IK = 
1, two parameter settings (/x^, Ei) and (/X 2 , ^ 2 ) are equivalent if L(/Xj^ — /X 2 ) = 0 
and L{Ei — E 2 )L* = 0, i.e. if /x^—/X 2 G 1, and {E 1 — E 2 ) G {InV* : v G K”}, 
a space which is usually denoted by sym^(l 0 K”). It is also useful to introduce 
the distributional symbol K ~ yV’(IK, E) for the generalized Wishart distribution 
of the random matrix K = XX* when X ~ N{K, 0, A). The key observation in 
|18j is that Dij = Ka + Kjj — 2Kij defines a linear transformation on symmetric 
matrices with kernel sym^(l G K") which implies that the distances follow a 
generalized Wishart distribution with kernel 1: —D ^ W(l, 2Eb) = W(l, —A) 
and 

Aij = Ebu + EBjj — . (7) 

In the multi-dimensional case with spherical within- and between covariances we 
generalize the above model to Gaussian random matrices X ~ X(/x, Eb ^ Id)- 
Note that the d columns of this matrix are i.i.d. copies. The distribution of 
the matrix of squared Euclidean distances D then follows a generalized Wishart 
with d degrees of freedom —D ~ >Vd(l,—A). This distribution differs from a 
standard Wishart in that the inverse matrix W = E^^ is substituted by the 
matrix W = W — and the determinant | • | is substituted by a 

generalized det(-)-symbol which denotes the product of the nonzero eigenvalues 
of its matrix-valued argument (note that W is rank-deficient). The conditional 
probability of a partition then reads 

P{B\D, .) cx W{-D\l,-A) ■ P„(L|e, k) 
cx det(W)5 exp (ftr(WLi)) • P„(P|^,fc). 

and the probability density function (which serves as likelihood function in the 
model) is then defined as 

f{D) cx det(W)i exp (|tr(WP)^. 


( 9 ) 


Note that in spite of the fact that this probability is written as a function of 
W = it is constant over all choices of Sb which lead to the same Z\, 

i.e. independent under translations of the row vectors in X. For the purpose of 
inferring the partition B, this invariance property means that one can simply 
use a block-partition covariance model Xb and assume that the (unobserved) 
matrix K follows a standard Wishart distribution parametrized hy Eb- We do 
not need to care about the exact form of K, since the conditional posterior for 
B depends only on D. Extensive analysis about the influence of encoding the 
translation invariance into the likelihood versus the standard WD process and 
row-mean subtraction was conducted in [32) . 


3 A Time-evolving Translation-invariant 
Wishart-Dirichlet Process 

In this section, we present a novel dynamic clustering approach, the time-evolving 
translation-invariant Wishart-Dirichlet process (Te-TiWD) for clustering dis¬ 
tance data that is available at multiple time points. In this model, we assume 
that pairwise distance data Dt with I < t < T is available over T time points. 
At every time point t all objects are fully exchangeable, and the number of data 
points may differ at the different time points. This model clusters data points 
over multiple time points, allowing group memberships and the number of clus¬ 
ters to evolve over time by addition, deletion or change in existing clusters. The 
model is based on the static clustering model that was proposed in |32j which 
is not able to account for a time structure. Note that our model completely 
ignores any information about the identities of the data points across the time 
points, which makes it possible to cluster different objects over time. Table 
summarizes notations which we will use in the following sections. 


3.1 The Model 

The aim of the proposed method is to cluster distance data Dt at multiple time 
points, for 1 < t < T. For every time point under consideration, t, we obtain a 
distance matrix Dt and we want to infer the partition matrix Bt, by utilizing 
the partitions from adjacent time points. By using information from adjacent 
time points, we expect better clustering results than clustering every time point 
independently. At every time point, the number of data points may differ, and 
some clusters may die out or evolve over time. The assumptions on the data are 
the following: 

Assumption 1 Given a partition Bt, a sequence of the assumed underlying 
nt-dimensional vectorial observations Xt^ S i = l,...,dt, are arranged as 
columns of the {nt x dt) matrix Xt, i.e. Xt^, ...,Xt^^ Af {0, Eb^), with covari¬ 

ance matrix 


^Bt — + ^At ■ 


( 10 ) 
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Dt 

distance matrix at time point t (cf. (6|) 

At 

A matrix at time point t (cf. (7l) 

Bt 

partition matrix at time point t 

kbt 

number of blocks bt present 
in the partition Bt 

nbt 

the size of block bt 

(-0 

size of block bt without object 1 

nt 

number of data points 
present at the t-th time point 

At 

kbf X kb^ matrix 


the between-class variance of 

block i and block j 


is defined as (Bi, B 2 ,..., Bt) 

[A.lLi 

is defined as (Ai, A 2 ,..., At) 

p{[Bt\T=.) 

= p(Bi)p(B2|Bi)....p(Bt|Bt-i) 
defines a first-order Markov chain 

p{[MT=^) 

= p(Ai)p(A2|Ai)....p(At| At-i) 
defines a first-order Markov chain 

[B]t- 

B matrices at all time points 
except at time point t 


Table 1. Notations used throughout this manuscript. 


Covariance matrix Sst- In the static clustering method, the underlying vecto¬ 
rial data was assumed to be distributed according to a Gaussian distribution 
with mean 0, xi,Xd A/'(0, Sb) with Sb '■= aln + PB, (cf. (§), where PB 
describes the between class covariance matrix. As P denotes a scalar, all clus¬ 
ters in the static clustering are equidistant (as demonstrated in left of Fig. [^. 
To model time evolving data, we need a more flexible between-class covariance 
matrix which allows that cluster centroids have different distances to each 
other. These full matrices are necessary for a time-evolving clustering model, 
as the clusters are coupled over the different time-points due to the geometric 
information of the clusters, and this coupling can only be captured by modeling 
a richer covariance. Hereby G is obtained in the following way 

Ea, = ZtAtZj ( 11 ) 

with Zt G {0,1}"*^^*’*. The matrix Zt associates an object with one out of 
clusters. As every object can only belong to exactly one cluster, Zt has a single 
element of 1 per row. In Fig. [^we demonstrate examples of PB and Ea^ as well 
as the corresponding cluster arrangements which the matrices imply. 
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|3B = 




p„ 

P12 

Pl3 

P21 

P22 

P23 

P31 

P32 

P33 


P 




P 




P 


Fig. 3. Different models for clustering. Left: example of the block diagonal structure 
of I3B for three blocks, all cluster centroids must be equidistant. Right: example of 
the full covariance matrix (for better readability, we drop the time index t in the 
figure), which allows differing distances between cluster centroids. 


Note that EAt is a more general version of /3i3: 


Lemma 1. Ea^ = /3B iff At,- 


0 iiiffj 
(3 \ii= j 


Prior over the block matrices Bt . The prior over the block matrices Bt is defined 
in the following way. The prior for Bt in one epoch is the Dirichlet-Multinomial 
prior over partitions as in Q. Using the definition of the conditional prior over 
clusters as defined in [2], we extend this idea to the prior over partitions. In a 
generative sense, the same idea is used to generate a labeled set of partitions 
and then we forget the labels to get a distribution over partitions. By we 

denote the size of block bt-i if the corresponding block is present at time point t 
as well. We consider the following generative process for a finite dynamic mixture 
model with k mixtures (cf. [2], Eqs. (4.5), (4.6) and (5.9)): for each time point 
t, we generate mixing proportions = (tt^i, ..., ntk) from a symmetric Dirichlet 
distribution Dii{^/k + nt-i, ...,^/k + nt-i). As in the static case, we generate a 
label sequence from a multinomial distribution and forget the labels introducing 
the random partition Bt- Integrating out the conditional distribution for 
Dirichlet-Multinomial prior over partitions, given the partitions in the previous 
time point (t — 1), can be written as: 


PnABt\Bt-i,^,k) = 


k\ 


A(C + nt-i) ribtgBt + ^/k + ribj 


(k - fcsj! Pint -b ^ -b nt-i) Ilh.GBt B{^/k + J 


( 12 ) 


Note that (12) defines a partition process as described in section i with Pnt 
being the marginal distribution of Pnt-n S'lid it also is an exchangeable process, 
as each P^^ is invariant under permutation of object indices. 


Prior over At . The prior over the At matrices is given by a Wishart distribution, 
P{At\At-i) ~ yVd{At\At-i) and S'o := P{Ai) = Wd(Ai|/fcjJ. The degrees of 
freedom d influences the behavior of the Wishart distribution: a low value for d 
allows drastic changes in the clustering structure, a high value for d allows fewer 
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changes. We also have to consider that the size of At-i, and At+i might 
differ, as it is possible that the number of clusters in every epoch is different. 
Therefore, we consider the following two cases: 

1 ) if there are more blocks at time t — 1 than at time t, i. e. ki,^_^ > ki,^: 

delete corresponding rows and columns in At-i. With A[_i we denote the 
“reduced” matrix. Then it holds that At ~ WdiAt_j^) 


2 ) if there are fewer blocks at time t—1 than at time t, i. e. if kbt_t < kb^'. 
first, draw a kb^_^ x kb^_i matrix A't from A't ~ yVd{At-i). Second, augment 
as many new rows and columns as needed to obtain the full positive definite 
X (fcbj matrix At. We can draw the additional rows and columns of At 
in the following way (see [S] for details): 


\A 21 A 22 j 


(13) 


with All := A't e A 21 e A 12 S and 

A 22 G R. One obtains A 12 , A 21 and A 22 in the following way: 

A12IA11 ~ A/'(0, All 0 s) 

A22.1 ~ y^i{d—kb^,s) ( 14 ) 

A22 = A22.I + A2iA^^ Ai2 

where s denotes a scalar value and d the degrees of freedom of the Wishart 
distribution >Vd(^t-i)- 


A graphical depiction of the generative model of Te-TiWD is given in Fig. 


Posterior over Bt and At. With the likelihood for every time point, analogous 
to Eq. and the prior over At and Bt, we can now write down the equations 
for the posterior over Bt and At for all time points t G {1,2, ...,T}: 

p{[Bt]^,[At]^\[Dt]^,*) cx l[W^{Dt\l,At)P{[Bt]J)P{[At]^) (15) 

T 

= ndet(lW)^exp (^tT{WtDt))p{[Bt]J)P{[At]^) (16) 

with Wt := Wt — {l^Wtl)~^Wtl'l^Wt, where Wt ■= (cf. Q and ([^). 


MCMC sampling for posterior inference. For applying MCMC sampling 
to sample from the posterior, we look at the conditional distributions. Consider 
the conditional distributions at each time point t\ 

p{Bt,At\Dt, [B]t-, [A]t_,*) (X 

WfiDt\l,At)P{Bt\Bt-i)PiBt+i\Bt)PiAt\At-i)PiAt+i\At) 


(17) 
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Fig. 4. Depiction of the generative model of Te-TiWD with all hyper-parameters 
and parameters. Shaded circles correspond to fixed or observed variables, unshaded to 
latent variables. Arrows that point to a box mean that the parameters apply to all the 
variables inside the box, whereas arrows that directly point to a variable only apply 
to that single variable. Dt denote the distance matrices observed at different points in 
time, Bt denote the inferred partitions and At the between class covariance matrices 
at different time points 1 < t < T. 


Posterior sampling for Bt- The posterior sampling involves sampling assign¬ 
ments. As we are dealing with non-conjugate priors in 0 , we use a Gibbs 
sampling algorithm with m auxiliary variables as presented in [20j . We consider 
the infinite model with k —>■ oo. The aim is to assign one object I in epoch t to 
either an existing cluster c, a new cluster that exists at epoch t — 1 or epoch 
t -b 1 or a totally new cluster. The prior probability that object I belongs to an 
exisiting cluster c at time point t is 

P{1 = c\Bt-i)P{Bt+i\Bt) cx (18) 

There exist four different prior probabilities of an object I belonging to a new 
cluster Cnew at time point t, which are summarized in table 


Cnew exists at P(1 = Cnew|Bt-l)P(Bt+l|Bt) 


both time points t — 1 and t + 1'. 

oc • ^) • nct+i (19) 

time point t — 1 but not at time point t -|- 1: 

« ^ • nct i (20) 

time point t Al but not at time point t — 1 

« It • nct+i (21) 

neither t — 1 nor t A 1, 

(i.e. 1 belongs to a completely new cluster) 

« ^ (22) 


Table 2. Table of prior probabilities. 
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Metropolis-Hastings update steps. In every time point, we need to sample j5 
values in the between-class variance matrix ■ To find the /3 values within 
one epoch, we sample the whole “new” At matrix, denoted by , with 

a Metropolis-Hastings algorithm (see [23]). With we denote the initial 

At matrix. As proposal distribution we chose a Wishart distribution, and for 
the prior we chose a Wishart distribution as well, leading to P{At^^^\At^i^) ~ 
and ^ 


Hyperparameters and Initialization. Our model includes the following hyperpa¬ 
rameters: the scale parameter a, the number k of clusters, the Dirichlet rate 
the degrees of freedom d and a scale parameter s. The model is not sensitive to 
the choice of s, and we fix s to 1. a is sampled from a Gamma distribution with 
shape and scale parameters r and m. For the number k of clusters, our frame¬ 
work is applicable to two scenarios: we can either assume k = oo which results 
in the CRP model, or we fix fc to a large constant which can be viewed as a 
truncated Ewens process. As the model does not suffer from the label switching 
problem, initialization is not a crucial problem. We initialize the block size with 
size 1, i. e. we start with one cluster for all objects. The Dirichlet rate ^ only 
weakly influences the likelihood, and the variance only decays with l/log(nt) 
(see m)- In practice, we should not expect to reliably estimate Rather, we 
should have some intuition about maybe guided by the observation that under 
the Ewens process model the probability of two objects belonging to the same 
cluster is 1/(1 -I- ^). We can then either define an appropriate prior distribution, 
or we can fix Due to the weak effect of ^ on conditionals, these approaches 
are usually very similar. The degrees of freedom d can be estimated by the rank 
of K, if it is known from a pre-processing procedure. As d is not a very critical 
parameter (all likelihood contributions are basically raised to the power of d), d 
might also be used as an annealing-type parameter for freezing a representative 
partition in the limit for d —> oo. 


Pseudocode. A pseudocode of the sampling algorithm is given in Algorithm 1. 


Algorithm 1 Pseudocode Te-TiWD 


for t = 1 to iteration do 
for t = 1 to T do 
for j = 1 to nt do 

Assign one object to an existing cluster or a new one using Eqs. (17l-(22l 
Update fcfjt 

end for 
end for 

for t = 1 to T do 

Sample new At matrix using Metropolis-Hastings 

end for 
end for 
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Complexity. We define one sweep of the Gibbs sampler as one complete update 
of [Bt,At). The most time consuming part in a sweep is the update of Bt by 
re-estimating the assignments to blocks for a single object (characterized by a 
row/column in I?t), given the partition of the remaining objects. Therefore we 
have to compute the membership probabilities in all existing blocks (and in a new 
block). Every time a new partition is analyzed, a naive implementation requires 
0{v?) costs for computing the determinant of Wt and the product WtDt- In one 
sweep we need to compute such probabilities for rit objects, summing up to 
costs of 0{rAkb^). This suggests that the scalability to large datasets can pose a 
problem. In this regard we plan to address run time in future work by investigat¬ 
ing the potential of variational methods, parallelizing the MCMC sampler and 
by updating parameters associated with multiple time points simultaneously. 


Identifiability of clusters. In some applications, it is of interest to identify and 
track clusters over time. For example by grouping newspaper articles into topics 
it might be interesting to know which topics are present over a long time period, 
when a new topic becomes popular and when a former popular topic dies out. 
Due to the translation-invariance of our novel longitudinal model, we additionally 
need a cluster mean to be able to track clusters over the time course. To estimate 
the mean of the clusters we propose to embed the “overall” data matrix D* G 
^NxN ]Y .— Ylt=int that contains the pairwise distances between all 
objects over all time points into a vector space, using kernel PCA. We hrst 
construct a positive semi-definite matrix K* which fulfills D*^ = K*^+K*j—2K*j. 
For correcting K*, we compute the centered matrix Kf = Q*K*Q* with Q*j = 
6ij — T, As a next step, we compute the eigenvalue decomposition of itf*, i.e. 
K* = VAV'^ and then project on the principal axes X* = VA^, i.e. we use the 
principal components as coordinate axes. By embedding the distances D* into 
a vector space, the underlying block structure might be distorted (see Fig. [^. 
As our aim is to find the underlying block structure, it is hence infeasible to 
embed the data for clustering. But, for tracking the clusters, we just need to 
find the mean of an already inferred block structure, i.e. we embed the data not 
for grouping data points, but for finding a mean of an already assigned partition 
that allows us to track the clusters over time. We embed all objects together 
and choose the same orthogonal transformations for all objects, which enables 
identifiability of cluster means over the time course. This preprocessing step 
is only necessary if one is interested in the identifiability of clusters, and X* 
needs only to be computed once outside the sampling routine. Since computing 
X* is computationally expensive, it is done only once as a preprocessing step 
if required. Computing X* within the sampling routine would slow down our 
sampler signihcantly. 
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4 Experiments 

4.1 Synthetic Experiments 


Well separated clusters. In a first experiment, we test our method on simu¬ 
lated data. We simulate data in two ways, first we generate data points accord¬ 
ingly to the model assumptions, and secondly we generate data independent of 
the model assumptions. We start with a small experiment where we consider five 
time points each with 20 data points per time point in 100 dimensions, i. e. we 
consider a small data set size and large dimension problem. 


Data generation. The data is generated (according to the model assumptions) 
in the following way: for the first time point, a random block matrix Bi of 
size ni = 20 is sampled with = 3 (i.e. we generate 3 blocks at time point 
1). A fcfej X matrix Ai is sampled from Wd{Ikb.^) and Bi is filled with the 
corresponding /? values from Ai, which leads to the ni x ni matrix Next, 
di = 100 samples from A/'(0„, Sbi) are drawn with = aim + where 
q: = 2, and stored in the (ni x di) matrix Xi. By choosing a = 2, we create well 
separated clusters. The similarity matrix Ki = XiX'l' is computed and squared 
distances are stored in matrix Di. For the following time points t > 1, the 
partition for the block matrix Bt of size nt is drawn from a Dirichlet-Multinomial 
distribution, conditioned on the partition at time point t — 1. A new At matrix 
is sampled from 'Wd{At_i). If the number of blocks in time points t and t—1 are 
different, we sample At according to Eq. (14). dt samples from A/’(0„j, are 
drawn with Sb^ = aim + The pairwise distances are stored in the matrix 
Dt- A PCA projection of this data is shown in Fig. [^for illustration. 
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Fig. 5. PCA projections of five time points with three well separated clusters per time 
point. Numbers and colors correspond to true labels. 


Experiments. We perform four illustrative experiments for well-separated data: 

a) 500 Gibbs sweeps are computed for the Te-TiWD cluster process (after a 
burn-in phase of 250 sweeps). We check convergence of the algorithm by an¬ 
alyzing the trace of the number of blocks kb^. during sampling. On this trace 
plot we observe after how many sweeps the sampler stabilizes (the number of 
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sweeps depends on the size of the data set). We observe a remarkable stabil¬ 
ity of the sampler (compared to the usual situations in traditional mixture 
models), which follows from the fact that no label-switching can appear. Fi¬ 
nally, we perform an annealing procedure to freeze a certain partition. Here, 
d is used as an annealing-type parameter for freezing a representative par¬ 
tition in the limit d —>■ oo. On a standard computer, this experiment took 
roughly 4 minutes, and the sampler stabilizes after roughly 50 sweeps. As 
the ground truth is known, we can compute the adjusted rand index as an 
indicator for the accuracy of the Te-TiWD model. We repeat the clustering 
process 50 times. The result is shown in form of a box plot (Te-TIWD) in 
Fig.|6] 

b) In order to compare the performance of the time-evolving model (Te-TiWD) 
to baseline models, we also run the static probabilistic clustering process 
as well as hierarchical clustering models (Ward, complete linkage and single 
linkage) on every time point separately and compute the averaged accuracy 
over all time points. For the comparison to the static probabilistic method 
[32], we use the same set-up as for Te-TiWD, we run 500 Gibbs sweeps with 
a burn-in phase of 250 sweeps and repeat it for 50 times. For the hierarchical 
methods, the resulting trees are cut at the number of clusters found by the 
nonparametric probabilistic model. Accuracy is computed for every time 
point separately, and then averaged over all time points. In this scenario, 
the static clustering models performs almost as well as the time-evolving 
clustering, see Fig. as expected in such a setting where all groups are well 
separated at every single time point. 

c) As a further comparison to a baseline dynamic clustering model, we embed 
the distances into a Euclidean vector space and run a Gaussian dynamic 
clustering model (Te-Gauss) on the embedded vectorial data. As the clusters 
are well separated, embedding the data and clustering on vectors works well, 
as shown in box plot “Te-Gauss” in Fig. 

d) As a last comparison we evaluate a pooled clustering over all time points. For 
this experiment, we not only need the pairwise distances at every single time 
point, but also the pairwise distances of objects across all time points. The 
number of sweeps and repetitions remains the same as in the experiments 
above. We conduct one clustering over all objects of all time points, and 
after clustering, we extract the objects belonging to the same time point 
and compute the rand index on every time point separately. This experiment 
shows worse results (see box plot “pooled” in Fig.[^, which can be explained 
as follows: by combining all time points to one data matrix, new clusters over 
all time points are found, this means clusters are shifted and objects over 
time are grouped together, introducing new clusters by reforming boundaries 
of old clusters. These new clusters inhibit objects to group together which 
would group together at single time points, destroying the underlying “true” 
cluster structure. 
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Fig. 6. We compare our new dynamic model (Te-TiWD) with baseline methods: static 
clustering as in m, combined clustering over all time points (pooled), a Gaussian 
time-evolving clustering model (Te-Gauss) as well as to Ward, complete linkage and 
single linkage. In this experiment with three well separated clusters per time point, all 
methods perform very well, except for pooling the data. 


Highly overlapping clusters. For a second experiment, we generated data 
in a similar way as above, but this time we create 5 highly overlapping clusters 
each with 200 data points per time point in 40 dimensions. A PCA projection 
of this data is shown in Fig. On a standard computer, this experiment took 
roughly 3 hours, and the sampler stabilizes after roughly 500 sweeps. Again, we 
compare the performance of the translation-invariant time-evolving clustering 
model with static state-of-the-art probabilistic and hierarchical clustering mod¬ 
els which cluster on every time point separately and a time-varying Gaussian 
clustering model on embedded data (Te-Gauss). For highly overlapping clusters, 
the new dynamic clustering model outperforms the static probabilistic clustering 
model [32], and the hierarchical models (Ward, complete linkage, single linkage) 
fail completely. Further, our new model Te-TiWD outperforms the dynamic, 
vectorial clustering model (Te-Gauss), demonstrating that embedding the data 
into a Euclidean vector space yields worse results than working on the distances 
directly. We tested the statistical significance with the Kruskal-Wallis rank-sum 
test and the Dunn post test with Bonferroni correction for pairwise analysis. 
These tests show that Te-TiWD performs significantly better than all clustering 
models we compared to. Results are shown in Fig. 


Data generation independent of model assumptions. We also generate 
data in a second way which is independent of the model assumptions to demon¬ 
strate that the performance of our model Te-TiWD is independent of the way the 
data was generated. To demonstrate this, we repeat the case of highly overlap¬ 
ping clusters over 5 time points and generate data in the following way: dynamic 
Gaussian clusters are generated over a period of 5 time points. At each time point 
five clusters are generated. 200 data points are available at every time point and 
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Te-TiWD Te-Gauss method by Ward complete single 
by [2] [32] linkage linkage 

time-evolving methods -static methods- 


Fig. 7. We compare our new model (Te-TiWD) with baseline methods on synthetic 
data for five highly overlapping clusters. Our model significantly outperforms all base¬ 
line methods. 


randomly split into 5 parts, every part representing the number of data points 
per cluster. For consecutive time points, the number of data points per cluster 
is sampled from a Dirichlet-Multinomial distribution. Every cluster is sampled 
from a Gaussian distribution with a large variance, resulting in highly over¬ 
lapping clusters. Between time steps, the cluster centers move randomly, with 
relocations sampled from the same distribution. Finally, at every time point, the 
model-based pairwise distance matrix Dt is computed, resulting in a series of 
moving distance matrices. On this second synthetic data set, Te-TiWD performs 
significantly better than all baseline methods as well, as shown in Fig. Note 
that for the comparison with the Gaussian dynamic clustering model (Te-Gauss) 
we first embed the distances Dt into vectorial data Xt and do not work on the 
simulated vectorial data directly, to obtain a fair comparison. 








Fig. 8. PCA projections of five time points of simulated data with five highly over¬ 
lapping clusters. Numbers and colors correspond to true labels. 
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Te-TiWD Te-Gauss method by Ward complete single 
by [2] [32] linkage linkage 

time-evolving methods -static methods- 


Fig. 9. We compare our new model (Te-TiWD) with baseline methods on synthetic 
data which is generated independent of the model assumptions for hve highly over¬ 
lapping clusters. We observe that our method significantly outperforms all baseline 
methods. 


4.2 Analysis of Brain Cancer Patient based on Electronic Health 
Records (EHR) 


We apply our proposed model to a dataset of clinical notes from brain cancer 
patients at Memorial Sloan Kettering Cancer Center (MSKCC). Brain cancer 
patients make up 1.4% of all cancer patients, annually. Survival is highly variable, 
depending on age, gender, cancer subtype, and progression when caught, but on 
average 33% of patients survive the first five years. 

As a first step, we partition a total of 195,297 sentences from 3,403 electronic 
health records (EHR) from 704 MSKCC brain cancer patients into groups of 
similar vocabulary. This is done by treating sentences as binary vectors with 
non-zero entries corresponding to vocabulary, and obtaining a similarity measure 
using ranked neighborhood comparisons [3T]. Sentences are clustered using this 
similarity measure with the Louvain method [8]. Using these sentence clusters 
as features, we obtain patient similarities with the same ranked neighborhood 
comparison method. We partition the patients documents into windows of one 
year each, and obtain three time points where enough documents are available 
to compute similarities between patients. At each year, we represent a patient 
with a binary vector whose length is the number of sentence clusters. A non¬ 
zero entry corresponds to an occurrence of that sentence cluster in the patient’s 
corpus during the specified time period. 

In the first year, we have 704 patients, in the second year 170 and in the third 
year 123 patients. This data set has specific features which make our model 
particularly suitable to cope with this kind of data. First, the number of patients 
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Fig. 10. Clusters over all 3 years of brain cancer patients. We find ten different cluster 
chains where 2 remain over al three years, 3 vanish after the second year and one new 
cluster comes up in year 2 and remains in the third year. Size of the tokens denote 
the cluster size, i.e. the number of patients per cluster. Note that patients can change 
clusters, so a cluster decreasing in size or disappearing does not necessarily mean those 
patients die or leave the hospital. 


differs in every year. Second, patients disappear over the time course, either due 
to death or due to leaving the hospital. Third, patients do not necessarily need 
to have a document every year, so a patient can be absent from year 2 and 
appear in year 3. This gap occurred a total of 31 times in our data set. This 
is why our flexible model is very well suited for this problem, as the model can 
deal with changing numbers of objects and changing number of clusters in every 
year, clusters can disappear or reappear, as well as patients. The result of our 
clustering model is shown in Fig. 10 On a standard computer, this experiment 
took roughly 6 hours, and the sampler stabilizes after roughly 500 sweeps. 


We observe ten different cluster chains over the time series. Note that patients 
can switch cluster chains over the years, as the tumor progresses, the status of 
the patient may change, resulting in more similarities to a different cluster chain 
than the year before. To analyze the results of the method, we will discuss the 
most and least deadly clusters in more detail, as analyzing all subtleties between 
clusters would be out of the scope of this paper. 

Cluster chain 1 is the most deadly cluster, with a death rate of 80%. Additionally, 
it only appears in the first year. Word clouds representing the sentence clusters 
of this patient group are shows in Fig. [m We can see that these patients are 
having seizures which indicates that the brain cancer is especially malicious. 
They also show sentence clusters about two types of blood cancers, b cell and 
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Fig. 11. Word clouds representing five sentence clusters that are observed in patients 
from cluster chain 1, the most deadly cluster. They describe patients that have blood 
cancers (lymphomas) in addition to brain cancer. 


mantle cell lymphoma, and prescription of cytarabine^ which treats these cancers. 
This combination of blood and brain cancers could explain why this cluster chain 
is so deadly. 

Cluster chain 5 is the least deadly chain with a death rate of 42%. Word clouds 
representing sentence clusters for this patient group are show in Fig. These 
clusters consist of mainly ’’follow-up” language, such as checking the patients’ 
gait, speech, reflexes and vision. The sentence clusters appear to indicate positive 
results, e.g. ’’Normal visual fields are intact”, and ’’Patient denies difficulty with 
speech, language, balance or gait” are two prototype sentences representing two 
sentence clusters that appear in this chain. Furthermore, there is a sentence 
cluster with prototype sentence ”no evidence for progression,” indicating that 
these patient’s cancers are in a manageable state. 


evidence gait balance 

progression reccurrence dGniGS speech 

without difficulty 

anklG knGG 
jerks reflexes 

Fig. 12. Word clouds representing four sentence clusters that are observed in patients 
from cluster chain 5, the most positive cluster. These sentence clusters are ’’follow-up” 
language, such as checking reflexes or the ability to walk and see well. This indicates 
that the patients are in a relatively stable state under regular observation. 


Modeling patients over time provides important insights for automated analyses 
and medical doctors, as it is possible to check for every patient how the state of 
the patient as represented by the cluster membership changes over time. Also, if a 
new patient enters the study, one can infer, based on similarity to other patients, 
how to classify and possibly treat this patient best or to suggest clinical trials for 
each patient. Such clustering methods therefore make an important step towards 
solving the technical challenges of personalized cancer treatment. 
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5 Conclusion 

In this work, we propose a novel dynamic Bayesian clustering model to cluster 
time-evolving distance data. A probabilistic model that is able to handle non- 
vectorial data in form of pairwise distances has the advantage that there is no 
need to embed the data into a vector space. To summarize, our contributions 
in this work are five-fold: i) We develop a dynamic probabilistic clustering ap¬ 
proach that circumvents the potentially problematic data embedding step by 
directly operating on pairwise time-evolving distance data, ii) Our model en¬ 
ables to track the clusters over time, giving information about clusters that die 
out or emerge over time, iii) By using a Dirichlet process prior, there is no need 
to fix the number of clusters in advance, iv) We test and validate our model 
on simulated data. We compare the performance of our new method with base¬ 
line probabilistic and hierarchical clustering methods, v) We use our model to 
cluster brain cancer patients into similar subgroups over a time course of three 
years. Dynamic partitioning of patients would play an important role in cancer 
treatment, as it enables inference from groups of similar patients to an individ¬ 
ual. Such an inference can help medical doctors to adapt or optimize existing 
treatments, assign billing codes, or predict survival times for a patient based on 
similar patients in the same group. 
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